The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

نویسندگان

  • Vijay S. Pai
  • Parthasarathy Ranganathan
  • Sarita V. Adve
چکیده

Current microprocessors exploit high levels of instruction level parallelism ILP through techniques such as multiple issue dynamic scheduling and non blocking reads This paper presents the rst detailed analysis of the impact of such processors on shared memory multiprocessors using a detailed execution driven simulator Using this analysis we also exam ine the validity of common direct execution simulation techniques that employ previous generation processor models to approximate ILP based multiprocessors We nd that ILP techniques substantially reduce CPU time in multiprocessors but are less e ective in reducing memory stall time Consequently despite the presence of inherent latency tolerating techniques in ILP processors memory stall time becomes a larger component of execution time and parallel e ciencies are generally poorer in ILP based multiprocessors than in previous generation multiprocessors Examining the validity of direct execution simu lators with previous generation processor models we nd that with appropriate approximations such sim ulators can reasonably characterize the behavior of ap plications with poor overlap of read misses However they can be highly inaccurate for applications with high overlap of read misses For our applications the er rors in execution time with these simulators range from to for the most commonly used model and from to for the most accurate model This work is supported in part by the National Science Foundation under Grant No CCR CCR and CDA and the Texas Advanced Technology Program under Grant No Vijay S Pai is also supported by a Fannie and John Hertz Foundation Fellowship Copyright IEEE Published in the Proceedings of the Third Interna tional Symposium on High Performance Computer Architecture February in San Antonio Texas USA Personal use of this material is permitted However permission to reprint republish this material for adver tising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists or to reuse any copyrighted component of this work in other works must be obtained from the IEEE Contact Man ager Copyrights and Permissions IEEE Service Center Hoes Lane P O Box Piscataway NJ USA Telephone Intl Introduction Shared memory multiprocessors built from com modity microprocessors are expected to provide high performance for a variety of scienti c and commer cial applications Current commoditymicroprocessors improve performance with aggressive techniques to ex ploit high levels of instruction level parallelism ILP For example the HP PA Intel Pentium Pro and MIPS R processors use multiple instruction is sue dynamic out of order scheduling multiple non blocking reads and speculative execution However most recent architecture studies of shared memory systems use direct execution simulators which typi cally assume a processor model with single issue static in order scheduling and blocking reads Although researchers have shown the bene ts of ag gressive ILP techniques for uniprocessors there has not yet been a detailed or realistic analysis of the im pact of such ILP techniques on the performance of shared memory multiprocessors Such an analysis is required to fully exploit advances in uniprocessor tech nology for multiprocessors Such an analysis is also required to assess the validity of the continued use of direct execution simulation with simple processor models to study next generation shared memory ar chitectures This paper makes two contributions This is the rst detailed study of the e ective ness of state of the art ILP processors in a shared memory multiprocessor using a detailed simula tor driven by real applications This is the rst study on the validity of using current direct execution simulation techniques to model shared memory multiprocessors built from ILP processors Our experiments for assessing the impact of ILP on shared memory multiprocessor performance show that all our applications see performance improve ments from the use of current ILP techniques in multi processors However the improvements achieved vary To appear in Proceedings of HPCA February widely In particular ILP techniques successfully and consistently reduce the CPU component of execution time but their impact on the memory read stall component is lower and more application dependent This de ciency arises primarily because of insu cient potential in our applications to overlap multiple read misses as well as system contention from more fre quent memory accesses The discrepancy in the impact of ILP techniques on the CPU and read stall components leads to two key e ects for our applications First read stall time becomes a larger component of execution time than in previous generation systems Second paral lel e ciencies for ILP multiprocessors are lower than with previous generation multiprocessors for all but one application Thus despite the inherent latency tolerating mechanisms in ILP processors multipro cessors built from ILP processors actually exhibit a greater potential need for additional latency reducing or hiding techniques than previous generation multi processors Our results on the validity of using current direct execution simulation techniques to approximate ILP multiprocessors are as follows For applications where our ILP multiprocessor fails to signi cantly overlap read miss latency a direct execution simulation using a simple previous generation processor model with a higher clock speed for the processor and the L cache provides a reasonable approximation However when ILP techniques e ectively overlap read miss latency all of our direct execution simulation models can show signi cant errors for important metrics Overall for total execution time the most commonly used direct execution technique gave to error while the most accurate direct execution technique gave to error The rest of the paper is organized as follows Sec tion describes our experimental methodology Sec tions describe and analyze our results Section discusses related work Section concludes the paper Experimental Methodology The following sections describe the metrics used in our evaluation the architectures simulated the simu lation environment and the applications Measuring the Impact of ILP To determine the impact of ILP techniques in mul tiprocessors we compare two multiprocessor systems ILP and Simple equivalent in every respect except the processor used The ILP system uses state of the art high performance microprocessors with multiple issue dynamic scheduling and non blocking reads We refer to such processors as ILP processors The Simple system uses previous generation microproces sors with single issue static scheduling and blocking reads matching the processor model used in many current direct execution simulators We refer to such processors as Simple processors We compare the ILP and Simple systems to determine how multiprocessors bene t from ILP techniques rather than to propose any architectural tradeo between the ILP and Simple architectures Therefore both systems have the same clock rate and feature an identical aggressive mem ory system and interconnect suitable for ILP systems Section provides more detail on these systems The key metric we use to evaluate the impact of ILP is the speedup in execution time achieved by the ILP system over the Simple system which we call the ILP speedup To study the factors a ecting ILP speedup we study the components of execution time busy func tional unit stall synchronization stall and data mem ory stall However these components are di cult to distinguish with ILP processors as each instruction can potentially overlap its execution with both previ ous and following instructions We hence adopt the fol lowing convention also used in other studies If in a given cycle the processor retires the maximum allowable number of instructions we count that cycle as part of busy time Otherwise we charge that cycle to the stall time component corresponding to the rst instruction that could not be retired Thus the stall time for a class of instructions represents the num ber of cycles that instructions of that class spend at the head of the instruction window also known as the reorder bu er or active list before retiring We analyze the e ect of each component of execu tion time by examining the ILP speedup of that com ponent which is the ratio of the times spent on the component with the Simple and ILP systems Simulated Architectures We model processor NUMA shared memory sys tems with the system nodes connected by a two dimensional mesh Our systems use an invalidation coherence protocol and are release consistent The following details the processors and memory hierarchy modeled Figure summarizes our system parameters The extended version of this paper also includes results for and processor systems and a sensitivity analysis for several parameters Processor Models Our ILP processor resembles the MIPS R processor with way issue dynamic scheduling non blocking reads register re naming and speculative execution Unlike the MIPS R however our processor implements release To appear in Proceedings of HPCA February ILP Processor Processor speed MHz Maximum fetch retire rate instructions per cycle Instruction issue window entries Functional units integer arithmetic oating point address generation Branch speculation depth Memory unit size entries Network parameters Network speed MHz Network width bits Flit delay per hop network cycles Cache parameters Cache line size bytes L cache on chip Direct mapped K L request ports L hit time cycle Number of L MSHRs L cache o chip way associative K L request ports L hit time cycles pipelined Number of L MSHRs Write bu er entries cache lines Memory parameters Memory access time cycles ns Memory transfer bandwidth bytes cycle Memory Interleaving way Figure System parameters consistency The Simple processor uses single issue static scheduling and blocking reads and has the same clock speed as the ILP processor Most recent direct execution simulation studies as sume single cycle latencies for all processor functional units We choose to continue with this approxima tion for our Simple model to represent currently used simulation models To minimize sources of di er ence between the Simple and ILP models we also use single cycle functional unit latencies for ILP proces sors Nevertheless to investigate the impact of this approximation we simulated all our applications on an processor ILP system with functional unit laten cies similar to the UltraSPARC processor We found that the approximation has negligible e ect on all ap plications except Water even with Water our overall results continue to hold This approximation has lit tle impact because in multiprocessors memory time dominates and ILP processors can easily overlap func tional unit latency For the experiments related to the validity of direct execution simulators we also investigate variants of the Simple model that re ect approximations for ILP based multiprocessors made in recent literature These are further described in Section Memory Hierarchy The ILP and Simple sys tems have an identical memory hierarchy with identi cal parameters Each system node includes a processor with two levels of caching a merging write bu er between the caches and a portion of the distributed memory and directory A split transaction system bus connects the memory the network interface and the rest of the system node The L cache has request ports allowing it to serve up to data requests per cycle and is write through with a no write allocate policy The L cache has request port and is a fully pipelined write back cache with a write allocate policy Each cache also has an additional port for incoming coherence mes sages and replies Both the L and L caches have miss status holding registers MSHRs which reserve space for outstanding cache misses the L cache allocates MSHRs only for read misses as it is no write allocate The MSHRs support coalescing so that multiple misses to the same line do not initiate multiple requests to lower levels of the memory hierar chy We do not include such coalesced requests when calculating miss counts for our analysis We choose cache sizes commensurate with the input sizes of our applications based on the methodology of Woo et al Primary working sets of all our applications t in the L cache and secondary working sets of most applications do not t in the L cache Simulation Environment We use the Rice Simulator for ILP Multiprocessors RSIM to simulate the ILP and Simple architectures described in Section RSIM models the proces sors memory system and network in detail including contention at all resources It is driven by application executables rather than traces allowing interactions between the processors to a ect the course of the sim ulation The code for the processor and cache subsys tem performs cycle by cycle simulation and interfaces with an event driven simulator for the network and memory system The latter is derived from the Rice Parallel Processing Testbed RPPT Since we simulate the processor in detail our sim ulation times are ve to ten times higher than those for an otherwise equivalent direct execution simula tor To speed up simulation we assume that all in structions hit in the instruction cache with a cycle hit time and that all accesses to private data hit in the L data cache These assumptions have also been made by many previous multiprocessor studies using direct execution We do however model contention for processor resources and L cache ports due to pri vate data accesses The applications are compiled with a version of the SPARC V gcc compiler modi ed to eliminate branch delay slots and restricted to bit code with the op To appear in Proceedings of HPCA February Application Input Size Cycles LU by matrix block FFT points Radix K radix K keys max K Mp d particles Water molecules Erlebacher by by cube block Figure Application characteristics tions O funrollloop Applications We use six applications for this study LU FFT and Radix from the SPLASH suite Mp d and Water from the SPLASH suite and Erlebacher from the Rice parallel compiler group We mod i ed LU slightly to use ags instead of barriers for better load balance Figure gives input sizes for the applications and their execution times on a Simple uniprocessor We also study versions of LU and FFT that include ILP speci c optimizations that can be implemented in a compiler Speci cally we use function inlining and loop interchange to schedule read misses closer to each other so that they can be overlapped in the ILP processor We refer to these optimized applications as LU opt and FFT opt Impact of ILP on a Multiprocessor This section describes the impact of ILP on multi processors by comparing the processor Simple and ILP systems described in Section Overall Results Figures a and b illustrate our key overall re sults For each application Figure a shows the total ILP speedup as well as the ILP speedup of the dif ferent components of execution time The execution time components include CPU time data memory stalls and synchronization stalls Figure b indi cates the relative importance of the ILP speedups of the di erent components by showing the time spent on each component normalized to the total time on the Simple system The busy and stall times are calculated as explained in Section All of our applications exhibit speedup with ILP processors but the speci c speedup seen varies greatly from in Radix to in LU opt All the applications achieve similar and signi cant CPU ILP speedup to In contrast the data memory ILP speedup is lower and varies greatly across the ap plications from a slowdown in Radix to in LU opt We chose to combine the busy time and functionalunit FU stalls together into CPU time when computing ILP speedups because the Simple processor does not see any FU stalls The key e ect of the high CPU ILP speedups and low data memory ILP speedups is that data memory time becomes more dominant in ILP multiprocessors than in Simple multiprocessors Further since CPU ILP speedups are fairly consistent across all applica tions and data memory time is the only other dom inant component of execution time the data mem ory ILP speedup primarily shapes the overall ILP speedups of our applications We therefore analyze the factors that in uence data memory ILP speedup in greater detail in Section Synchronization ILP speedup is also low and varies widely across applications However since synchro nization does not account for a large portion of the execution time it does not greatly in uence the overall ILP speedup Section discusses the factors a ect ing synchronization ILP speedup in our applications Data Memory ILP Speedup We rst discuss various factors that can contribute to data memory ILP speedup Section and then show how these factors interact in our applications Section Contributing Factors Figure b shows that memory time is dominated by read miss time in all of our applications We therefore focus on factors in uencing read miss ILP speedup These factors are summarized in Figure The read miss ILP speedup is the ratio of the to tal stall time due to read misses in the Simple and ILP systems The total stall time due to read misses in a given system is simply the product of the aver age number of L misses and the average exposed or unoverlapped L cache miss latency Equation in Figure uses the above terms to express the read miss ILP speedup and isolates two contributing factors the miss factor and the unoverlapped factor Miss factor This is the rst factor isolated in Equation It speci es the ratio of the miss counts in the Simple and ILP systems These miss counts can di er since reordering and speculation in the ILP processor can alter the cache miss behavior A miss factor greater than thus contributes positively to read miss ILP speedup as the ILP system sees fewer misses than the Simple system Unoverlapped factor This is the second factor isolated in Equation It speci es the ratio of the exposed or unoverlapped miss latency in the ILP and Simple systems The lower the unoverlapped factor the higher the read miss ILP speedup In the Simple system the entire L miss latency is unoverlapped To understand the factors contribut ing to unoverlapped latency in the ILP system Equa To appear in Proceedings of HPCA February || 0 | 1 | 2 | 3 | 4

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies

A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a regi...

متن کامل

Execution-Driven Simulation of IP Router Architectures

A number of approaches have been recently proposed by different vendors for the next generation Internet router architectures, capable of processing millions of packets per second. Most of this processing speed stems from employing latest high-performance network processor or multiprocessors as the forwarding engine of the router. However, all these improvements have been proposed without any d...

متن کامل

W.m. Zuberek: Performance of Fine-grain Multithreaded Multiprocessors Performance Analysis of Fine–grain Multithreaded Multiprocessors

Instruction–level multithreading is an architectural approach to tolerating long–latency memory accesses and synchronization delays in distributed–memory systems. The paper presents a timed Petri net model of a fine–grain multithreaded distributed–memory multiprocessor system at the instruction execution level, and illustrates performance analysis by results obtained from simulation of the deri...

متن کامل

An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors

We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at run-time. Besides resolving many name and true data dependencies through dynamic renaming and fo...

متن کامل

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

ÐCurrent microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latencyhiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are les...

متن کامل

Chip Multiprocessors – A Cost-effective Alternative to Simultaneous Multithreading

In this paper we describe the principles of the chip multiprocessor architecture, overview design alternatives and present some example processors of this type. We discuss the results of several simulations where chip multiprocessor was compared to other advanced processor architectures including superscalars and simultaneous multithreading processors. Although simultaneous multithreading seems...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997